9 research outputs found

    Genetic Sequence Matching Using D4M Big Data Approaches

    Full text link
    Recent technological advances in Next Generation Sequencing tools have led to increasing speeds of DNA sample collection, preparation, and sequencing. One instrument can produce over 600 Gb of genetic sequence data in a single run. This creates new opportunities to efficiently handle the increasing workload. We propose a new method of fast genetic sequence analysis using the Dynamic Distributed Dimensional Data Model (D4M) - an associative array environment for MATLAB developed at MIT Lincoln Laboratory. Based on mathematical and statistical properties, the method leverages big data techniques and the implementation of an Apache Acculumo database to accelerate computations one-hundred fold over other methods. Comparisons of the D4M method with the current gold-standard for sequence analysis, BLAST, show the two are comparable in the alignments they find. This paper will present an overview of the D4M genetic sequence algorithm and statistical comparisons with BLAST.Comment: 6 pages; to appear in IEEE High Performance Extreme Computing (HPEC) 201

    Rapid Sequence Identification of Potential Pathogens Using Techniques from Sparse Linear Algebra

    Full text link
    The decreasing costs and increasing speed and accuracy of DNA sample collection, preparation, and sequencing has rapidly produced an enormous volume of genetic data. However, fast and accurate analysis of the samples remains a bottleneck. Here we present D4^{4}RAGenS, a genetic sequence identification algorithm that exhibits the Big Data handling and computational power of the Dynamic Distributed Dimensional Data Model (D4M). The method leverages linear algebra and statistical properties to increase computational performance while retaining accuracy by subsampling the data. Two run modes, Fast and Wise, yield speed and precision tradeoffs, with applications in biodefense and medical diagnostics. The D4^{4}RAGenS analysis algorithm is tested over several datasets, including three utilized for the Defense Threat Reduction Agency (DTRA) metagenomic algorithm contest

    A Linear Algebra Approach to Fast DNA Mixture Analysis Using GPUs

    Full text link
    Analysis of DNA samples is an important step in forensics, and the speed of analysis can impact investigations. Comparison of DNA sequences is based on the analysis of short tandem repeats (STRs), which are short DNA sequences of 2-5 base pairs. Current forensics approaches use 20 STR loci for analysis. The use of single nucleotide polymorphisms (SNPs) has utility for analysis of complex DNA mixtures. The use of tens of thousands of SNPs loci for analysis poses significant computational challenges because the forensic analysis scales by the product of the loci count and number of DNA samples to be analyzed. In this paper, we discuss the implementation of a DNA sequence comparison algorithm by re-casting the algorithm in terms of linear algebra primitives. By developing an overloaded matrix multiplication approach to DNA comparisons, we can leverage advances in GPU hardware and algoithms for Dense Generalized Matrix-Multiply (DGEMM) to speed up DNA sample comparisons. We show that it is possible to compare 2048 unknown DNA samples with 20 million known samples in under 6 seconds using a NVIDIA K80 GPU.Comment: Accepted for publication at the 2017 IEEE High Performance Extreme Computing conferenc

    Construction of an ~700-kb transcript map around the Familial Mediterranean Fever locus on human chromosome 16p13.3

    Get PDF
    We used a combination of cDNA selection, exon amplification, and computational prediction from genomic sequence to isolate transcribed sequences from genomic DNA surrounding the familial Mediterranean fever (FMF) locus. Eighty-seven kb of genomic DNA around D16S3370, a marker showing a high degree of linkage disequilibrium with FMF, was sequenced to completion, and the sequence annotated. A transcript map reflecting the minimal number of genes encoded within the ∼700 kb of genomic DNA surrounding the FMF locus was assembled. This map consists of 27 genes with discreet messages detectable on Northerns, in addition to three olfactory-receptor genes, a cluster of 18 tRNA genes, and two putative transcriptional units that have typical intron–exon splice junctions yet do not detect messages on Northerns. Four of the transcripts are identical to genes described previously, seven have been independently identified by the French FMF Consortium, and the others are novel. Six related zinc-finger genes, a cluster of tRNAs, and three olfactory receptors account for the majority of transcribed sequences isolated from a 315-kb FMF central region (betweenD16S468/D16S3070 and cosmid 377A12). Interspersed among them are several genes that may be important in inflammation. This transcript map not only has permitted the identification of the FMF gene (MEFV), but also has provided us an opportunity to probe the structural and functional features of this region of chromosome 16.Michael Centola, Xiaoguang Chen, Raman Sood, Zuoming Deng, Ivona Aksentijevich, Trevor Blake, Darrell O. Ricke, Xiang Chen, Geryl Wood, Nurit Zaks, Neil Richards, David Krizman, Elizabeth Mansfield, Sinoula Apostolou, Jingmei Liu, Neta Shafran, Anil Vedula, Melanie Hamon, Andrea Cercek, Tanaz Kahan, Deborah Gumucio, David F. Callen, Robert I. Richards, Robert K. Moyzis, Norman A. Doggett, Francis S. Collins, P. Paul Liu, Nathan Fischel-Ghodsian and Daniel L. Kastne

    Two Different Antibody-Dependent Enhancement (ADE) Risks for SARS-CoV-2 Antibodies

    No full text
    COVID-19 (SARS-CoV-2) disease severity and stages varies from asymptomatic, mild flu-like symptoms, moderate, severe, critical, and chronic disease. COVID-19 disease progression include lymphopenia, elevated proinflammatory cytokines and chemokines, accumulation of macrophages and neutrophils in lungs, immune dysregulation, cytokine storms, acute respiratory distress syndrome (ARDS), etc. Development of vaccines to severe acute respiratory syndrome (SARS), Middle East Respiratory Syndrome coronavirus (MERS-CoV), and other coronavirus has been difficult to create due to vaccine induced enhanced disease responses in animal models. Multiple betacoronaviruses including SARS-CoV-2 and SARS-CoV-1 expand cellular tropism by infecting some phagocytic cells (immature macrophages and dendritic cells) via antibody bound Fc receptor uptake of virus. Antibody-dependent enhancement (ADE) may be involved in the clinical observation of increased severity of symptoms associated with early high levels of SARS-CoV-2 antibodies in patients. Infants with multisystem inflammatory syndrome in children (MIS-C) associated with COVID-19 may also have ADE caused by maternally acquired SARS-CoV-2 antibodies bound to mast cells. ADE risks associated with SARS-CoV-2 has implications for COVID-19 and MIS-C treatments, B-cell vaccines, SARS-CoV-2 antibody therapy, and convalescent plasma therapy for patients. SARS-CoV-2 antibodies bound to mast cells may be involved in MIS-C and multisystem inflammatory syndrome in adults (MIS-A) following initial COVID-19 infection. SARS-CoV-2 antibodies bound to Fc receptors on macrophages and mast cells may represent two different mechanisms for ADE in patients. These two different ADE risks have possible implications for SARS-CoV-2 B-cell vaccines for subsets of populations based on age, cross-reactive antibodies, variabilities in antibody levels over time, and pregnancy. These models place increased emphasis on the importance of developing safe SARS-CoV-2 T cell vaccines that are not dependent upon antibodies

    In vivo Monitoring of Transcriptional Dynamics After Lower-Limb Muscle Injury Enables Quantitative Classification of Healing

    No full text
    Traumatic lower-limb musculoskeletal injuries are pervasive amongst athletes and the military and typically an individual returns to activity prior to fully healing, increasing a predisposition for additional injuries and chronic pain. Monitoring healing progression after a musculoskeletal injury typically involves different types of imaging but these approaches suffer from several disadvantages. Isolating and profiling transcripts from the injured site would abrogate these shortcomings and provide enumerative insights into the regenerative potential of an individual’s muscle after injury. In this study, a traumatic injury was administered to a mouse model and healing progression was examined from 3 hours to 1 month using high-throughput RNA-Sequencing (RNA-Seq). Comprehensive dissection of the genome-wide datasets revealed the injured site to be a dynamic, heterogeneous environment composed of multiple cell types and thousands of genes undergoing significant expression changes in highly regulated networks. Four independent approaches were used to determine the set of genes, isoforms, and genetic pathways most characteristic of different time points post-injury and two novel approaches were developed to classify injured tissues at different time points. These results highlight the possibility to quantitatively track healing progression in situ via transcript profiling using high- throughput sequencing

    Early onset of industrial-era warming across the oceans and continents

    Get PDF
    The evolution of industrial-era warming across the continents and oceans provides a context for future climate change and is important for determining climate sensitivity and the processes that control regional warming. Here we use post-ad 1500 palaeoclimate records to show that sustained industrial-era warming of the tropical oceans first developed during the mid-nineteenth century and was nearly synchronous with Northern Hemisphere continental warming. The early onset of sustained, significant warming in palaeoclimate records and model simulations suggests that greenhouse forcing of industrial-era warming commenced as early as the mid-nineteenth century and included an enhanced equatorial ocean response mechanism. The development of Southern Hemisphere warming is delayed in reconstructions, but this apparent delay is not reproduced in climate simulations. Our findings imply that instrumental records are too short to comprehensively assess anthropogenic climate change and that, in some regions, about 180 years of industrial-era warming has already caused surface temperatures to emerge above pre-industrial values, even when taking natural variability into account
    corecore